[SPARK-22799][ML] Bucketizer should throw exception if single- and multi-column params are both set by mgaido91 · Pull Request #19993 · apache/spark

mgaido91 · 2017-12-15T18:23:42Z

What changes were proposed in this pull request?

Currently there is a mixed situation when both single- and multi-column are supported. In some cases exceptions are thrown, in others only a warning log is emitted. In this discussion https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049, the decision was to throw an exception.

The PR throws an exception in Bucketizer, instead of logging a warning.

How was this patch tested?

modified UT

…lti-column params are both set

SparkQA · 2017-12-15T19:28:28Z

Test build #84970 has finished for PR 19993 at commit 8f3581c.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2017-12-15T20:14:59Z

Jenkins, retest this please

SparkQA · 2017-12-15T21:18:53Z

Test build #84978 has finished for PR 19993 at commit 8f3581c.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-12-18T03:03:28Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

+    if (isSet(inputCols) && isSet(inputCol) || isSet(inputCols) && isSet(outputCol) ||
+      isSet(inputCol) && isSet(outputCols)) {
+      throw new IllegalArgumentException("Both `inputCol` and `inputCols` are set, `Bucketizer` " +
+        "only supports setting either `inputCol` or `inputCols`.")


Here it is better to add one more check:
if single column, only set splits param.
if multiple column, only set splitsArray param. and check:
inputCols.length == outputCols.length == splitsArray.length

thanks. I will add these checks while doing the changes requested by @hhbyyh. Thanks.

hhbyyh

@mgaido91, Thanks for the PR.

Since we have several classes that need the improvement and more classes will support HasInputCols, I feel like we should develop some code that can be shared by other classes.
We also need common unit test methods for the case.

mgaido91 · 2017-12-18T07:29:15Z

@hhbyyh thanks for the review. I see that for some classes there are ongoing PRs. Thus I cannot change them now in order to have a common place and a common test. Should I wait for those PRs to be merged then? Or were you suggesting something different?
Thanks.

hhbyyh · 2017-12-18T22:47:11Z

I would suggest to develop the common infrastructure and unit test first, then other PR can take it or we can send follow-up fix.

cc @MLnick for advice.

viirya · 2017-12-19T08:04:36Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

@@ -140,10 +140,10 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
   * by `inputCol`. A warning will be printed if both are set.


We should modify the document above too.

mgaido91 · 2017-12-19T18:12:18Z

@hhbyyh I created a common infrastructure. Please let me know if I have to modify it. Meanwhile, I'd like to discuss where to put the common UTs: do you have any specific idea about the right place? Thanks.

SparkQA · 2017-12-19T18:25:53Z

Test build #85123 has finished for PR 19993 at commit bb0c0d2.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-19T18:30:25Z

Test build #85124 has finished for PR 19993 at commit 9f56800.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-19T19:09:14Z

Test build #85128 has finished for PR 19993 at commit f593f5b.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-19T20:16:06Z

Test build #85130 has finished for PR 19993 at commit 2ecdc73.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh

Thanks for the update.

I like the current implementation that can be shared by other classes. For the test case, I think it makes sense to add something in checkParams.

cc @MLnick @jkbradley @srowen

hhbyyh · 2017-12-19T20:32:43Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

-    } else {
-      false
+    ParamValidators.assertColOrCols(this)
+    if (isSet(inputCol) && isSet(splitsArray)) {


I noticed isBucketizeMultipleColumns is invoked in many places and maybe we can put the checks in other places like transformSchema. It also makes the code consistent with function name.

hhbyyh · 2017-12-19T20:42:57Z

mllib/src/main/scala/org/apache/spark/ml/param/params.scala

+    model match {
+      case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) =>
+        raiseIncompatibleParamsException("inputCols", "inputCol")
+      case m: HasOutputCols with HasInputCol if m.isSet(m.outputCols) && m.isSet(m.inputCol) =>


This may not necessarily be an error for some classes, but we can keep it for now.

hhbyyh · 2017-12-19T20:45:45Z

mllib/src/main/scala/org/apache/spark/ml/param/params.scala

+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  def assertColOrCols(model: Params): Unit = {


private[spark]

hhbyyh · 2017-12-19T20:45:56Z

mllib/src/main/scala/org/apache/spark/ml/param/params.scala

+    }
+  }
+
+  def raiseIncompatibleParamsException(paramName1: String, paramName2: String): Unit = {


private[spark]

hhbyyh · 2017-12-19T20:49:30Z

mllib/src/main/scala/org/apache/spark/ml/param/params.scala

+  }
+
+  def raiseIncompatibleParamsException(paramName1: String, paramName2: String): Unit = {
+    throw new IllegalArgumentException(s"Both `$paramName1` and `$paramName2` are set.")


Error message can be more straight forward. e.g. $paramName1 and $paramName2 cannot be set simultaneously.

SparkQA · 2017-12-19T21:40:22Z

Test build #85133 has finished for PR 19993 at commit 26fe05e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2017-12-19T22:08:16Z

@hhbyyh thanks for the comments. I already fixed all your comments, but I am waiting to push for the UT. Honestly I think that checkParam is not the best place. Checking the exception requires setting the parameters and invoking transform, thus I am not sure it is the best place, since it is a very generic one and at the moment we are setting no parameter there. What do you think?

hhbyyh · 2017-12-20T19:28:28Z

To make it available for other classes, we need to support checking for both fit and transform, that means we also need a sample input Dataset, so we may have to add the explicit test in each of the test suite. But we can still create some infrastructure function for the explicit test to invoke.
E.g. we can create some function in object ParamsSuite or other places
checkMultiColumnParams(obj: Params, sampleData: Dataset[_])

mgaido91 · 2017-12-20T21:22:46Z

thanks @hhbyyh, I updated the PR according to your suggestion and previous comments.

SparkQA · 2017-12-20T22:18:08Z

Test build #85209 has finished for PR 19993 at commit 9872bfd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh

Thanks. Add some suggestions for function names and test utility.

hhbyyh · 2017-12-20T22:28:21Z

mllib/src/main/scala/org/apache/spark/ml/param/params.scala

+   * this is not true, an `IllegalArgumentException` is raised.
+   * @param model
+   */
+  private[spark] def assertColOrCols(model: Params): Unit = {


suggestion for function name:
assertColOrCols --> checkMultiColumnParams

hhbyyh · 2017-12-20T22:29:37Z

mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala

+   * @param paramsClass The Class to be checked
+   * @param spark A `SparkSession` instance to use
+   */
+  def checkMultiColumnParams(paramsClass: Class[_ <: Params], spark: SparkSession): Unit = {


suggestion for function name:
checkMultiColumnParams --> testMultiColumnParams

hhbyyh · 2017-12-20T22:34:10Z

mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala

+    // create fake input Dataset
+    val feature1 = Array(-1.0, 0.0, 1.0)
+    val feature2 = Array(1.0, 0.0, -1.0)
+    val df = feature1.zip(feature2).toSeq.toDF("feature1", "feature2")


I don't think the DataFrame here can be used for other transformers like StringIndexer.

How about add a Dataset as function parameter? And is it possible to use an instance obj: Params rather than paramsClass: Class[_ <: Params] as parameter, just to be more flexible.

The reason why I created the dataframe inside the method was to control the names of the columns it has. Otherwise we can't ensure that those columns exist. I think that the type check is performed later, thus it is not a problem here. What do you think?

I preferred to use paramsClass: Class[_ <: Params] because I need a clean instance for each of the two checks: if an instance is passed I cannot enforce that it is clean, ie. some parameters weren't already set and I would need to copy it to create new instances as well, since otherwise the second check would be influenced by the first one. What do you think?

Thanks.

We can send column names as parameter if necessary. We need to ensure the test utility can be used by most transformers with multiple column support.

I think that the type check is performed later, thus it is not a problem here.

I don't quite get it, in either transform or fit the data type will be checked and they will trigger exceptions.

I preferred to use paramsClass: Class[_ <: Params]

I'm only thinking about the case that default constructor is not sufficient to create a working Estimator/Transformer. If that's not a concern, then reflection is OK.

SparkQA · 2017-12-21T11:40:14Z

Test build #85253 has finished for PR 19993 at commit d0b8d06.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-17T17:57:25Z

Test build #86282 has finished for PR 19993 at commit 25b9bd4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2018-01-19T19:02:42Z

Since RC1 for 2.3 failed, it'd be great to get this into 2.3. @mgaido91 do you mind if I send my comments along with a PR to update this PR of yours? I'm rushing because of the time pressure to get this into 2.3 (to avoid a behavior change between 2.3 and 2.4). Thanks in advance!

jkbradley

I'm just leaving early comments to note the main issue I see, especially since they are relevant to #20146 . I'll send a PR soon, later today.

jkbradley · 2018-01-19T19:53:47Z

mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala

  @Since("1.4.0")
  override def transformSchema(schema: StructType): StructType = {
-    if (isBucketizeMultipleColumns()) {
+    ParamValidators.checkExclusiveParams(this, "inputCol", "inputCols")


The problem with trying to use a general method like this is that it's hard to capture model-specific requirements. This currently misses checking to make sure that exactly one (not just <= 1) of each pair is available, plus that all of the single-column OR all of the multi-column Params are available. (The same issue occurs in #20146 ) It will also be hard to check these items and account for defaults.

I'd argue that it's not worth trying to use generic checking functions here.

my initial implementation (with @hhbyyh's comments) was more generic and checked what you said. After, @MLnick and @viirya asked to switch to a more generic approach which is the current you see. I'm fine with either of those, but I think we need to choose one way and go in that direction, otherwise we just loose time.

I see. I'll see if I can come up with something which is generic but handles these other checks.

jkbradley · 2018-01-19T19:54:12Z

mllib/src/main/scala/org/apache/spark/ml/param/params.scala

 @DeveloperApi
 object ParamValidators {

+  private val LOGGER = LoggerFactory.getLogger(ParamValidators.getClass)


Let's switch this to use the Logging trait, to match other MLlib patterns.

mgaido91 · 2018-01-19T20:01:16Z

@jkbradley sure no problem, let me know how I can help.

… column support

jkbradley · 2018-01-20T06:26:15Z

OK, sent: mgaido91#1

strengthened requirements about exclusive Params for single and multicolumn support

SparkQA · 2018-01-20T18:44:25Z

Test build #86416 has finished for PR 19993 at commit d9d25b0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-21T11:37:50Z

Test build #86437 has finished for PR 19993 at commit 8c162a3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-21T13:47:49Z

Test build #86439 has finished for PR 19993 at commit 7894609.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2018-01-22T13:48:28Z

mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala

+      ("inputCols", Array("feature1", "feature2")))
+    ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", "result1"),
+      ("outputCols", Array("result1", "result2")))
+    ParamsSuite.testExclusiveParams(new Bucketizer, df, ("splits", Array(-0.5, 0.0, 0.5)),


Only comment I have is that I believe this line is not testing what you may think.

As I read the checkSingleVsMultiColumnParams method, in this test case it will throw the error, not because both splits and splitsArray are set, but rather because both inputCol & inputCols are unset.

Actually it applies to the line above too.

@jkbradley

@MLnick actually it will fail for both reasons. We can add more test cases to check each of these two cases if you think it is needed.

MLnick · 2018-01-22T13:54:53Z

Overall looks good with @jkbradley's changes. I just left a comment on the param test cases as I think they're not quite complete

MLnick · 2018-01-22T18:58:20Z

Well yes it would - but the method checks inputCols/inputCol first so will always fail for that reason here, ie we aren’t actually testing the full code path

…

On Mon, 22 Jan 2018 at 16:43, Marco Gaido ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala <#19993 (comment)>: > - test("Both inputCol and inputCols are set") { - val bucket = new Bucketizer() - .setInputCol("feature1") - .setOutputCol("result") - .setSplits(Array(-0.5, 0.0, 0.5)) - .setInputCols(Array("feature1", "feature2")) - - // When both are set, we ignore `inputCols` and just map the column specified by `inputCol`. - assert(bucket.isBucketizeMultipleColumns() == false) + test("assert exception is thrown if both multi-column and single-column params are set") { + val df = Seq((0.5, 0.3), (0.5, -0.4)).toDF("feature1", "feature2") + ParamsSuite.testExclusiveParams(new Bucketizer, df, ("inputCol", "feature1"), + ("inputCols", Array("feature1", "feature2"))) + ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", "result1"), + ("outputCols", Array("result1", "result2"))) + ParamsSuite.testExclusiveParams(new Bucketizer, df, ("splits", Array(-0.5, 0.0, 0.5)), @MLnick <https://github.com/mlnick> actually it will fail for both reasons. We can add more test cases to check each of these two cases if you think it is needed. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19993 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA_SB1zkYZ5V4SOlLliOtxQ_6CCvoBm4ks5tNJ6egaJpZM4RD1b4> .

SparkQA · 2018-01-23T12:07:44Z

Test build #86525 has finished for PR 19993 at commit ebc6d16.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick

A couple minor comments - but that can be left for a clean up in a follow up PR if need be.

I'd prefer to merge this to branch-2.3.

MLnick · 2018-01-24T14:26:12Z

mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala

 import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.{Estimator, Transformer}
 import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasInputCols, HasOutputCol, HasOutputCols}


I don't think these are used any longer?

MLnick · 2018-01-24T14:31:26Z

mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala

+    ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", "feature1"),
+      ("splits", Array(-0.5, 0.0, 0.5)))
+
+    // the following should fail because not all the params are set


Technically here we should probably also test the inputCols + outputCols case (i.e. that not setting splitsArray also throws an exception).

SparkQA · 2018-01-24T16:35:31Z

Test build #86589 has finished for PR 19993 at commit 2bc5cb4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2018-01-24T18:25:53Z

+1 merge this to 2.3

MLnick

LGTM

…lti-column params are both set ## What changes were proposed in this pull request? Currently there is a mixed situation when both single- and multi-column are supported. In some cases exceptions are thrown, in others only a warning log is emitted. In this discussion https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049, the decision was to throw an exception. The PR throws an exception in `Bucketizer`, instead of logging a warning. ## How was this patch tested? modified UT Author: Marco Gaido <marcogaido91@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #19993 from mgaido91/SPARK-22799. (cherry picked from commit cd3956d) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>

MLnick · 2018-01-26T10:24:09Z

Merged to master / branch-2.3

MLnick · 2018-01-26T10:24:39Z

Thanks @mgaido91 and @jkbradley for working on this and others for review

[SPARK-22799][ML] Bucketizer should throw exception if single- and mu…

8f3581c

…lti-column params are both set

WeichenXu123 reviewed Dec 18, 2017

View reviewed changes

hhbyyh reviewed Dec 18, 2017

View reviewed changes

viirya reviewed Dec 19, 2017

View reviewed changes

address comments

bb0c0d2

fix doc

9f56800

fix mima error

f593f5b

use ParamValidators

2ecdc73

fix ut

26fe05e

hhbyyh reviewed Dec 19, 2017

View reviewed changes

address review comments

64634b5

add checkMultiColumnParams

9872bfd

hhbyyh reviewed Dec 20, 2017

View reviewed changes

address review comments

d0b8d06

jkbradley reviewed Jan 19, 2018

View reviewed changes

strengthened requirements about exclusive Params for single and multi…

18bbf61

… column support

Merge pull request #1 from jkbradley/mgaido91-SPARK-22799

d9d25b0

strengthened requirements about exclusive Params for single and multicolumn support

fix style error

8c162a3

fixt ut error

7894609

MLnick mentioned this pull request Jan 22, 2018

[SPARK-22797][PySpark] Bucketizer support multi-column #19892

Closed

MLnick reviewed Jan 22, 2018

View reviewed changes

add all cases to UT

ebc6d16

MLnick approved these changes Jan 24, 2018

View reviewed changes

review comment

2bc5cb4

MLnick approved these changes Jan 25, 2018

View reviewed changes

WeichenXu123 approved these changes Jan 25, 2018

View reviewed changes

asfgit closed this in cd3956d Jan 26, 2018

		@@ -140,10 +140,10 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
		* by `inputCol`. A warning will be printed if both are set.

Conversation

mgaido91 commented Dec 15, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 15, 2017

Uh oh!

mgaido91 commented Dec 15, 2017

Uh oh!

SparkQA commented Dec 15, 2017

Uh oh!

WeichenXu123 Dec 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhbyyh left a comment

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Dec 18, 2017

Uh oh!

hhbyyh commented Dec 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Dec 19, 2017

Uh oh!

SparkQA commented Dec 19, 2017

Uh oh!

SparkQA commented Dec 19, 2017

Uh oh!

SparkQA commented Dec 19, 2017

Uh oh!

SparkQA commented Dec 19, 2017

Uh oh!

hhbyyh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 19, 2017

Uh oh!

mgaido91 commented Dec 19, 2017

Uh oh!

hhbyyh commented Dec 20, 2017

Uh oh!

mgaido91 commented Dec 20, 2017

Uh oh!

SparkQA commented Dec 20, 2017

Uh oh!

hhbyyh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhbyyh Dec 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhbyyh Dec 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 21, 2017

Uh oh!

SparkQA commented Jan 17, 2018

WeichenXu123 Dec 18, 2017 •

edited

Loading

hhbyyh left a comment •

edited

Loading

hhbyyh Dec 20, 2017 •

edited

Loading

hhbyyh Dec 21, 2017 •

edited

Loading